Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destinations v2: Do not dedup raw table #31520

Merged
merged 15 commits into from
Oct 25, 2023
Merged

Destinations v2: Do not dedup raw table #31520

merged 15 commits into from
Oct 25, 2023

Conversation

edgao
Copy link
Contributor

@edgao edgao commented Oct 17, 2023

closes #30710

Stop deduping the raw table. This necessitates some changes to how T+D works overall. There are also a huge amount of test changes to fix the raw table expectations. (many of these are purely organizational - e.g. combining the dedup/nondedup raw expected records files, since they're now identical)

The final tables remain (almost) identical to previous behavior. The only difference is in how we handle typing failures in the _ab_cdc_deleted_at column. Given that connectors generate that column internally (i.e. we have perfect control over it) I don't think that matters.

previous T+D logic:

  1. insert new raw records into the final table (where loaded_at is null OR deleted_at is not null)
  2. delete duplicate records from final table (where row_number != 1)
  3. delete duplicate records from raw table (where raw_id not in final_table)
  4. CDC deletes (where raw_id not in (select from raw_table where deleted_at is not null))
  5. commit raw table

new logic:

  1. insert new raw records into the final table ((where loaded_at is null OR deleted_at is not null) AND row_number = 1)
    1. These are now deduped before they even make it to the final table. This isn't strictly necessary yet, but it will be needed for the merge-based T+D implementation (where each existing final row must match at most one new raw row).
  2. delete duplicate records from final table (where row_number != 1)
  3. CDC deletes (where deleted_at is not null)
    1. This is a significant change from the old logic. Because we're keeping all the old raw records, the logic of deleting records with a deletion raw record is no longer valid.
    2. But we're still behaving correctly. The old deletion tombstone records are reinserted in step 1 on every T+D invocation, so if the deletion record is the most-recent record then we'll still correctly trigger the delete.
    3. This does mean we now only execute CDC deletions if the cast call is successful.
  4. commit raw table

(upcoming merge logic: replace steps 1, 2, and 3 with a single merge statement that does an upsert+CDC deletes in one step)

@vercel
Copy link

vercel bot commented Oct 17, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 25, 2023 4:03pm

@octavia-squidington-iii octavia-squidington-iii added the area/connectors Connector related issues label Oct 17, 2023
@github-actions
Copy link
Contributor

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@edgao edgao force-pushed the edgao/skip_raw_dedup branch from 3a28809 to 0ae0e67 Compare October 17, 2023 22:48
assertAll(
() -> assertEquals("bar", actualRawRecords.get(0).get("_airbyte_data").get("string").asText()),
() -> assertEquals("bar", actualFinalRecords.get(0).get(generator.buildColumnId("string").name()).asText()));
assertEquals("bar", actualFinalRecords.get(0).get(generator.buildColumnId("string").name()).asText());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no longer any reason to assert the raw records, so only assert the final record.

dumpRawTableRecords(streamId),
5,
6,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -270,7 +272,7 @@ public void fullRefreshAppend() throws Exception {

runSync(catalog, messages2);

final List<JsonNode> expectedRawRecords2 = readRecords("dat/sync2_expectedrecords_fullrefresh_append_raw.jsonl");
final List<JsonNode> expectedRawRecords2 = readRecords("dat/sync2_expectedrecords_append_raw.jsonl");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no longer any difference between append and dedup raw records, so merge their expectedrecords files.

@edgao edgao marked this pull request as ready for review October 17, 2023 23:26
@edgao edgao requested a review from a team as a code owner October 17, 2023 23:26
@airbyte-oss-build-runner

This comment was marked as outdated.

@airbyte-oss-build-runner

This comment was marked as outdated.

@airbyte-oss-build-runner

This comment was marked as outdated.

@airbyte-oss-build-runner

This comment was marked as outdated.

@airbyte-oss-build-runner

This comment was marked as outdated.

@airbyte-oss-build-runner

This comment was marked as outdated.

@edgao edgao marked this pull request as draft October 18, 2023 15:49
@edgao edgao marked this pull request as ready for review October 18, 2023 21:39
@airbyte-oss-build-runner

This comment was marked as outdated.

@airbyte-oss-build-runner

This comment was marked as outdated.

@edgao
Copy link
Contributor Author

edgao commented Oct 23, 2023

confirmed via manual test that downgrading from this version to current master results in us re-deduping the raw table (i.e. rollback works as expected). (cc @jbfbell )

@airbyte-oss-build-runner

This comment was marked as outdated.

@airbyte-oss-build-runner

This comment was marked as outdated.

@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Oct 25, 2023
@edgao edgao enabled auto-merge (squash) October 25, 2023 16:02
@airbyte-oss-build-runner
Copy link
Collaborator

destination-snowflake test report (commit b0ca13a992) - ✅

⏲️ Total pipeline duration: 09mn08s

Step Result
Build connector tar
Java Connector Unit Tests
Build destination-snowflake docker image for platform(s) linux/amd64
Java Connector Integration Tests
Validate metadata for destination-snowflake
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

@airbyte-oss-build-runner
Copy link
Collaborator

destination-bigquery test report (commit b0ca13a992) - ✅

⏲️ Total pipeline duration: 07mn16s

Step Result
Build connector tar
Java Connector Unit Tests
Build destination-bigquery docker image for platform(s) linux/amd64
Java Connector Integration Tests
Validate metadata for destination-bigquery
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@edgao edgao merged commit 148dda1 into master Oct 25, 2023
15 of 17 checks passed
@edgao edgao deleted the edgao/skip_raw_dedup branch October 25, 2023 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

T&D Should not dedupe raw table (and maintain current CDC delete behavior)
5 participants